The first thing to do is to import packages that will be used.
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import AdaBoostClassifier
import tensorflow as tf
from tensorflow import keras
For this project I looked at a lot of different solutions to this problem and got ideas on dealing with cleaning the data, projects I used for helping me are:
https://github.com/ngbolin/PortoSeguroXGB/blob/master/Porto%20Seguro%20Prediction.ipynb
https://github.com/Jihenghuang/kaggle-porto-seguro/blob/master/porto-seguro-jiheng.ipynb
First we read in the data, since the dataset indicate a missing value with -1 we make sure we mark the values -1 as missing value. We do that while reading in the data by indicating that na_values equal -1.
trainingData = pd.read_csv("~/Data/train.csv",na_values=-1)
testingData = pd.read_csv("~/Data/test.csv",na_values=-1)
trainingData.info()
testingData.info()
Our first step is to take a look at the data and analyse which variables we should keep and how to deal with missing values. We make a seperate dataset with columns that have missing values and print out a table of the numbers of missing values.
missingData = trainingData.columns[trainingData.isnull().any()].tolist()
missingDataset = trainingData[missingData]
missingDataset.info()
numberOfNas = missingDataset.isnull().sum().sort_values(ascending = False)
numberOfNas = pd.DataFrame({"Missing values": numberOfNas})
numberOfNas.head(13)
numberOfNasPercentage = missingDataset.isnull().sum().sort_values(ascending = False) / 595212
numberOfNasPercentage = pd.DataFrame({"Missing values": numberOfNasPercentage})
numberOfNasPercentage.head(13)
After analysing the missing values it can be seen that two variables have more than 40 percent of their values missing. These variables will be removed since it will be hard to fill such a large part of them in a good way.
Next up is to make a correlation plot of all the variables to analyse the variables further.
plt.rcParams.update({'font.size': 40})
corrMatrix = trainingData.corr()
plt.figure(figsize=(60,60))
plt.title("Correlation Plot")
sns.heatmap(corrMatrix,cmap='Set3',cbar_kws={'label': 'Correlation coefficient'})
trainingData1 = trainingData.copy()
testingData1 = testingData.copy()
From the plot above it can be seen that variables with "calc" in the name are not correlated to anything so they don't seem to be important as a predictor. They will alll be removed to safe computing time.
Now we remove the variables that won't be used
trainingData1 = trainingData1.drop(['ps_car_03_cat', 'ps_car_05_cat'], axis = 1)
testingData1 = testingData1.drop(['ps_car_03_cat', 'ps_car_05_cat'], axis = 1)
missingDataset = missingDataset.drop(['ps_car_03_cat', 'ps_car_05_cat'], axis = 1)
dropVars = trainingData1.columns[trainingData1.columns.str.startswith('ps_calc')]
trainingData1 = trainingData1.drop(dropVars,axis =1)
testingData1 = testingData1.drop(dropVars,axis=1)
For dealing with missing values we will try three methods. We have continous variables with missing values and categorical variables. The first method is to fill the continous variables with the median value and to fill the categorical with the most common value.
The second method is to fill the continous variables with the median value and to to add a seperate category for missing values for the categorical variables.
The third method is to interpolate the continuos variables so the distribution will be more even.
These methods will be compared by fitting a model to each model and compare the results. If method 3 and method 2 beats method 1 we will both interpolate continous variables and have a seperate category for categorical variables.
missingCatVars = missingDataset.columns[missingDataset.columns.str.endswith('cat')]
missingContVars = missingDataset.columns[~missingDataset.columns.str.endswith('cat')]
trainingData2= trainingData1.copy()
testingData2 = testingData1.copy()
trainingData3= trainingData1.copy()
testingData3 = testingData1.copy()
Below we fill the missing values with different methods.
Method 1:
for col in (missingCatVars):
trainingData1[col].fillna(value=trainingData1[col].mode()[0], inplace=True)
testingData1[col].fillna(value=testingData1[col].mode()[0], inplace=True)
for col in (missingContVars):
trainingData1[col].fillna(value=trainingData1[col].median(), inplace=True)
testingData1[col].fillna(value=testingData1[col].median(), inplace=True)
Method 2:
for col in (missingCatVars):
trainingData2[col].fillna(value=-1, inplace=True)
testingData2[col].fillna(value=-1, inplace=True)
for col in (missingContVars):
trainingData2[col].fillna(value=trainingData2[col].median(), inplace=True)
testingData2[col].fillna(value=testingData2[col].median(), inplace=True)
Method 3:
for col in (missingCatVars):
trainingData3[col].fillna(value=trainingData3[col].mode()[0], inplace=True)
testingData3[col].fillna(value=testingData3[col].mode()[0], inplace=True)
for col in (missingContVars):
trainingData3[col].interpolate(method ='linear',inplace = True)
To see the difference between the methods lets look at a histogram of the variables. We see the biggest difference for method 3 when the continous variables have a more balanced distribution.
#DROP IDS FIRST NOT RELEVANT IN PLOT
trainingData1 = trainingData1.drop(["id"], axis =1)
trainingData2 = trainingData2.drop(["id"], axis =1)
trainingData3 = trainingData3.drop(["id"], axis =1)
Method 1:
myFigure = trainingData1.hist(bins=50,figsize=(100,100))
plt.show()
Method 2:
myFigure2 = trainingData2.hist(bins=50,figsize=(100,100))
plt.show()
Method 3:
myFigure3 = trainingData3.hist(bins=50,figsize=(100,100))
plt.show()
Add dummy variables for all categorical variables for all datasets
dummyVars = trainingData1.columns[trainingData1.columns.str.endswith('cat')]
dummyVars
trainingData1 = pd.get_dummies(trainingData1, columns=dummyVars)
testingData1 = pd.get_dummies(testingData1, columns=dummyVars)
trainingData2 = pd.get_dummies(trainingData2, columns=dummyVars)
testingData2 = pd.get_dummies(testingData2, columns=dummyVars)
trainingData3 = pd.get_dummies(trainingData3, columns=dummyVars)
testingData3 = pd.get_dummies(testingData3, columns=dummyVars)
Make data ready for fitting
modelData = trainingData1.copy()
testData = testingData1.drop(['id'],axis = 1)
response = trainingData1['target']
modelData = modelData.drop(['target'],axis = 1)
modelData2 = trainingData2.copy()
testData2= testingData2.drop(['id'],axis = 1)
response2 = trainingData2['target']
modelData2 = modelData2.drop(['target'],axis = 1)
modelData3 = trainingData3.copy()
testData3= testingData3.drop(['id'],axis = 1)
response3 = trainingData3['target']
modelData3 = modelData3.drop(['target'],axis = 1)
Fit logistic regression model on all methods, this will be our baseline model.
Method 1:
#Solver added to model to silence a future warning about change of solver
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression(solver='liblinear')
logmodel.fit(modelData,response)
Method 2:
from sklearn.linear_model import LogisticRegression
logmodel2 = LogisticRegression(solver='liblinear')
logmodel2.fit(modelData2,response2)
Method 3:
from sklearn.linear_model import LogisticRegression
logmodel3 = LogisticRegression(solver='liblinear')
logmodel3.fit(modelData3,response3)
Compare all the models with cross validation where we use the roc_auc score as an evaluation metric
from sklearn.model_selection import cross_val_score
score1 = cross_val_score(logmodel, modelData, response, cv=5, scoring="roc_auc")
score1
from sklearn.model_selection import cross_val_score
score2 = cross_val_score(logmodel2, modelData2, response2, cv=5, scoring="roc_auc")
score2
from sklearn.model_selection import cross_val_score
score3 = cross_val_score(logmodel3, modelData3, response3, cv=5, scoring="roc_auc")
score3
Method 1:
np.mean(score1)
Method 2:
np.mean(score2)
Method 3:
np.mean(score3)
The results indicate that the data with categories for the missing data gives the best results.This is method 2. Method 3 showed no signs of improvement over method 1 so we will keep the continous variables filled with the median values. We will use data filled with method 2 to fit more models with logistic regression as the baseline model
Finally we fit the model on the unseen test dataset and upload a submission to kaggle with our baseline model. The public score calculated was 0.258.
myIds = testingData['id']
testResults = logmodel2.predict_proba(testData2)
submission = pd.DataFrame( { 'id': myIds , 'target': testResults[:,1]} )
submission = submission[['id', 'target']]
submission.to_csv("submission1.csv", index = False)
Next up we fit a random forest classifier, we start with 200 estimators with a max depth of 6
#Fit random forest model
from sklearn.ensemble import RandomForestClassifier
ranforClass = RandomForestClassifier(n_estimators = 200, max_depth = 6)
ranforClass.fit(modelData2, response2)
import numpy as np
from sklearn.model_selection import cross_val_score
myScores2 = cross_val_score(ranforClass, modelData2, response2, scoring="roc_auc", cv=5)
myScores2
np.mean(myScores2)
This model does not beat the roc_auc score of our baseline model. We will try to finetune our model by selecting different values for max depth and number of estimators. My laptop does not have great computing power so I did not pick many different values and used RandomisedSearchCV instead of GridSearchCV.
from sklearn.model_selection import RandomizedSearchCV
myParams = {"max_depth": [4,6,8,10],
"n_estimators": [10,100,200,500],}
myClassifierModel= RandomForestClassifier()
grid_search = RandomizedSearchCV(myClassifierModel, myParams, cv = 5, random_state=190030150, scoring = 'roc_auc',return_train_score=True)
grid_search.fit(modelData2,response2)
bestRfModel = grid_search.best_estimator_
bestRfModel
The best model returned has 10 as max depth and 200 as number of estimators. Lets estimate the roc auc score for that one with cross validation
bestRfModel.fit(modelData2, response2)
import numpy as np
from sklearn.model_selection import cross_val_score
myScores3 = cross_val_score(bestRfModel, modelData2, response2, scoring="roc_auc", cv=5)
myScores3
np.mean(myScores3)
This is a slight improvement from the previous logistic model. We fit the model on the unseen test dataset and get a score of 0.256 which is lower than our baseline score. This indicates that the model fit slightly worse on our unseen dataset than the logistic regression.
myIds = testingData['id']
testResults = bestRfModel.predict_proba(testData2)
submission = pd.DataFrame( { 'id': myIds , 'target': testResults[:,1]} )
submission = submission[['id', 'target']]
submission.to_csv("submission2.csv", index = False)
submission.head(10)
Next up we try AdaBoost a boosting algorithm, we fit it with 200 estimators and a learning rate of 0.5
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
boostModel = AdaBoostClassifier(n_estimators=200,
algorithm="SAMME.R", learning_rate=0.5)
boostModel.fit(modelData2, response2)
import numpy as np
from sklearn.model_selection import cross_val_score
myScores3 = cross_val_score(boostModel, modelData2, response2, scoring="roc_auc", cv=5)
myScores3
np.mean(myScores3)
The roc auc score indicates an even better score than the random forest classifier, lets try to finetune it by trying different values for the hyperparameters, that is learning rate and number of estimators.
from sklearn.model_selection import RandomizedSearchCV
myParams2 = {"n_estimators": [10,200,500],
'learning_rate' : [0.01,0.1,0.5],}
boostModel2 = AdaBoostClassifier(algorithm="SAMME.R")
boostModel2 = AdaBoostClassifier()
grid_search2 = RandomizedSearchCV(boostModel2, myParams2, cv = 3, random_state=190030150, scoring = 'roc_auc',return_train_score=True)
grid_search2.fit(modelData2,response2)
bestAdaModel = grid_search2.best_estimator_
bestAdaModel
After finetuning the hyperparameters we get the same hyperparameters as in the beginning so we will use that model. We upload it to kaggle and get a score of 0.267
myIds = testingData['id']
testResults = boostModel.predict_proba(testData2)
submission = pd.DataFrame( { 'id': myIds , 'target': testResults[:,1]} )
submission = submission[['id', 'target']]
submission.to_csv("submission3.csv", index = False)